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Abstract. We show that various formulations {e.g., dual and KuUback-Csiszar 
iterations) of estimation of maximum entropy (ME) models can be transformed to 
r . solving systems of polynomial equations in several variables for which one can use 

celebrated Grobner bases methods. Posing of ME estimation as solving polynomial 
equations is possible, in the cases where feature functions (sufficient statistic) that 
provides the information about the underlying random variable in the form of 
fi I expectations are integer valued. 
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1. Introduction 

Algebra has always played an important role in statistics, a classical example being 
linear algebra. There are also many other instances of applying algebraic tools in 
statistics (e.g (Viana & Richards 2001)). But, treating statistical models as algebraic 
objects, and thereby using tools of computational commutative algebra and algebraic 
geometry in the analysis of statistical models is very recent and has led to the still 
evolving field of algebraic statistics. 

The use of computational algebra and algebraic geometry in statistics was initiated 
in the work of Diaconis and Sturmfels (Diaconis & Sturmfels 1998) on exact hypothesis 
tests of conditional independence in contingency tables, and in the work of Pistone et 
al. (Pistone et al. 2001) in experimental design. The term 'Algebraic Statistics' was first 
coined in the monograph by Pistone et al. (Pistone et al. 2001) and appeared recently 
in the title of the book by Pachter and Sturmfels (Pachter & Sturmbfels 2005). 

To extract the underlying algebraic structures in discrete statistical models, 
algebraic statistics treat statistical models as afiine varieties. (An affine variety is the 
set of all solutions to family of polynomial equations.) Parametric statistical models are 
described in terms of a polynomial (or rational) mapping from a set of parameters to 
distributions. One can show that many statistical models, for example independence 
models, Bernouli random variable etc. (see (Pachter & Sturmbfels 2005) for more 
examples), can be given this algebraic formulation, and these are referred to as algebraic 
statistical models. 

Exponential models, which form the important class of statistical models, are 
studied in algebraic statistics under the name 'toric' models by using maximum 
likelihood methods. Toric models are algebraic statistical models and the term 'toric' 
comes from important algebraic objects known as 'toric ideals' in computational algebra. 
In this view of very established role of information theory in statistics (KuUback 1959, 
Csiszar & Shields 2004) this paper attempts to describe maximum entropy models in 
algebraic statistical framework. 

In particular, we show that maximum entropy models (also minimum relative- 
entropy models) are indeed toric models, when the functions that provide the 
information about the underlying random variable in the form of expected values are 
integer valued. We also show that when the information is available in the form of sample 
means, by modifying maximum entropy prescriptions calculating model parameters 
amounts to solving set of polynomial equations. This establishes a fact that set of 
statistical models results from maximum entropy methods are indeed algebraic varieties. 

A note on the results presented in this paper: we will not present the details on 
Grobner bases theory and related concepts to solve the polynomial equations due to 
space constraint; we refer reader to text books on computational algebra and Grobner 
basis theory (Adams & Loustaunau 1994, Cox et al. 1991). 

We organize our paper as follows. In § [2] we give basic notions of algebra and 
introduce notation along with an introduction to algebraic statistics. § [3] describes 



maximum entropy (ME) prescriptions in algebraic statistical framework by introducing 
important algebraic objects called toric ideals. In § H] we show how one can transform 
the problem of calculating ME distributions to solving set of polynomial equations. 

2. Algebraic Statistical Models 

2.1. Basic notions of Algebra 

Through out this paper k represents a field. A monomial in n indeterminates xi, . . . ,Xn 
is a power product of the form x°^ . . . x"" , where all the exponents are nonnegative 
integers, i.e. a^ G Z>o, i = l,...n. One can simplify the notation for monomial as 
follows: denote a = (ai, . . . , a„) G Z>q and by using multi- index notation we set 

•Xj iXj '\ • • • •X' „ 

with the understanding that x = (xi,...,x„). Note that x° = 1 when ever a = 
(0, . . . , 0). Once the order of the indeterminates are fixed, monomial x"^ . . . x"" = x" is 
identified by (ai, . . . , «„). Hence, set of all monomials in indeterminates Xi,...,x„ 
can be represented by Z>q. Theory of monomials is central to the celebrated 
Grobner bases theory in computational algebra which provides tools for solving 
set of polynomial equations and related problems in algebraic geometry (Mishra & 
Yap 1989). Monomial theory itself plays important role in algebraic statistics in the 
representation of exponential models where probabilities are expressed in terms of power 
products (Rapallo 2006). 

A polynomial / in xi, . . . , x„ with coefficients in A; is a finite linear combination of 
monomials and can be written in the form of 

f = Yl ^"^" ' 

where Aj C Z>q is a finite set and Ua G k. The collection of all polynomials in the 
indeterminates Xi,...,x„ is the set k[xi, . . . ,Xn] and it has structure not only of a 
vector space but also of a ring. Indeed the ring structure of k[xi, . . . , x^] plays main 
role in computational algebra and algebraic geometry. 

A subset a C A;[xi, . . . , x„] is said to be ideal if it satisfies: (i) G a (n) f,gEa, 
then f + g E a (iii) / G a and h G k[xi, . . . , x„] and then hf E a. A set V C k"- is said 
to be affine variety if there exists fi, ■ ■ ■ , fs G k[xi, . . . , x„] such that 

V = {(ci, . . .c„) G P : Mci, ...Cn) = 0,l<t<s} . 

We use the notation V(/i, . . . , fs) = V . 

2.2. Algebraic Statistical Model 

At the very core of the field of algebraic statistics lies the notion of an 'algebraic 
statistical model'. While this notion has the potential of serving as a unifying theme for 
algebraic statistics, there is no unified definition of an algebraic statistical model (Drton 



& Sullivant 2006). Here, we adopt the appropriate definition of statistical model 
from (Pachter & Sturmbfels 2005, Drton & Sullivant 2006). For a recent elaborate 
discussion on formal definition of algebraic statistical models one can refer to (Drton & 
Sullivant 2006). 

Let X be a discrete random variable taking finitely many values from the set 
[m] = {1,2, ...m}. A probability distribution p of X is naturally represented as a 
vector p = {pi, . . . ,Pm) G M"^ if we fix the order on [m]. Then set of all probability mass 
functions (pmfs) of X is called probability simplex 

m 

Am-1 = {p= [pi, . . . ,p™) G R^o : J2p^ = 1> ■ (1) 

i=l 

The index m — 1 indicates the dimension of the simplex Am_i. A statistical model Ai 
is a subset of A^-i and is said to be algebraic if 3/i, . . . , /^ G k[pi, . . . ,pm] such that 

A< = V(/i,...,/.)nA„_i . 

Now we move on to parametric statistical models and their algebraic formulations. 

Let B C R*^ be a parametric space and k : Q —>■ A^-i be a map. The image 
K,{Q) is called parametric statistical model. Given a statistical model A4 C A^.i, by 
parametrization of A4 we mean, identifying a set B C R"' and a function k : Q ^ A^-i 
such that A4 = k{Q). To describe more general statistical models in algebraic framework 
we need following notion of semi- algebraic set. 

Definition 2.1. A set <d (1 W^ is called semi- algebraic set, if there are two finite 
collection of polynomials F C k[xi, . . . ,Xd] and G C k[xi, . . . ,Xd] such that 

e = {eeM.'^: f{e) =0,yf eFandg{e) >0,geG} . 

Now we have following definition of parametric algebraic statistical model. 

Definition 2.2. Let Am_i be a probability simplex and Q G W^ be a semi- algebraic 
set. Let K, : M.'^ ^ R™ be a rational function (a rational function is a quotient of two 
polynomials) such that k(B) C Am-i. Then the image M. = /c(B) is a parametric 
algebraic statistical model. 

Conversely, a parametric statistical model Ai = k(Q) C A„_i is said to be algebraic 
if B is semi- algebraic set and «: is a rational function. From now on we refer to 
'parametric algebraic statistical models' as 'algebraic statistical models'. 

In this paper we consider following special case of algebraic statistical models 
(cf. (Pachter & Sturmbfels 2005, pp 7)). Consider a map 

K : B(C R'^) -^ R" 

K:e = {ei,...,9d)^Me),...,Kmm (2) 

where Ki G k[6i, . . . ,6^]. We assume that B satisfies Ki{9) > 0, i = l,...,m and 
Yl^i ^i{^) = 1 fo^^ &^y 6* G B. Under these conditions k{Q) is indeed an algebraic 
statistical model (Definition 12. 2p since n{Q) C Am-i, k is a polynomial function and 



is a semi-algebraic set {H = {Y^^i fi ~ 1} ^^^ ^ = {/« : ^ = 1, • • • iTn] in the 
Definition [21]). 

Some statistical models are naturally given by a polynomial map k ([2]) for which 
the condition 'YlT=i ^ii^) ~ ^ does not hold. If this is the case one can consider following 
algebraic statistical model: 

/t : g = (gi, . . . , g,) ^ ^ ^ (Ki(g), . . . , K^{e)) , (3) 

assuming that remaining conditions that have been specified for the model ([2]) are valid 
here too. The only difference is that instead of k being a polynomial map, we have it 
as a rational map. 

3. ME in algebraic statistical setup 

3.1. Toric Models 

In the algebraic description of exponential models monomials and binomials play a 
fundamental role. The study of relations of power products lead to the theory of 
toric ideals in the commutative algebra (Sturmbfels 1996). Here we describe basic 
notion of toric ideal that are relevant to representation and computation of discrete 
exponential models; for more details on theory and computation of toric ideals one can 
refer to (Sturmbfels 1996, Bigatti & Robbiano 2001, Bigatti et al. 1999). 

Before we give the definition of toric ideal, we describe the notion of Laurent 
polynomial. If we allow negative exponents in a polynomial i.e., polynomial of the 
form / = XlaeA '^a^" where a G Z", it is known as Laurent polynomial (Aj C Z"q is 
finite). Set of all Laurent polynomials in the indeterminates xi, . . . ,Xn is denoted by 
k[xf, . . . , x^] and it also has a structure of a ring. 

Now we define the toric ideal. 

Definition 3.1. Let A = [aij] G Z'*^" be a matrix with rank d. Consider the ring 
homeomorphism 

Tx : k[xi,...,Xn] -^ k[6f,...,e^] 



TT : x, h^ e",'' . . . Bl'' (4) 

The toric ideal a^ oj A is defined as the kernel oj the map tt, i.e., a^ = kern. 

The mapping tt can be viewed as "parametrization" and which can be explained 
by the following description of tt. Consider a map 

TT : Z^g — > Z 

71 : u = {ui, . . . , Un) 1-^ Au. (5) 

The map tt lifts to the ring homomorphism tt in the sense of action of tt on x" = 

x"^ . . . xj^" G k[xi, . . . , x„]. That is 

/ d \ "1 / d \ "" 

7r(x-) = 7r(xr,...,x-)=fn^"0 •••(n^""] (6) 
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Toric ideal theory plays an important role in applications of computational algebraic 
geometry like integer programming etc.(cf. (Sturmbfels 1996)). Note that in the 
algebraic descriptions of exponential models and their maximum likelihood estimates 
only non- negative cases of toric ideals (and hence toric models) is considered i.e., the 
matrix A = [aij] in Definition 13.11 is assumed to be nonnegative and the map (jlj) 
is specified as n : k[xi, . . . ,Xn] — ^ k[9i, . . . ,6d] (see (Pachter & Sturmbfels 2005)). 
As described later in this paper, in the algebraic descriptions of maximum entropy 
models one has to deal with the Laurent polynomials and hence one has to include the 
negative case in the definitions of toric ideals and toric models. This poses no problem 
because toric ideal theory in commutative algebra naturally includes the negative case 
(as in Definition 13.11) and Grobner bases theory can be extended to Laurent polynomial 
ring (Pauer & Unterkircher 1999). 

The concept of toric ideals let to the description of exponential models under the 
name toric models in algebraic statistics which is defined as follows. 

Definition 3.2. Let A E Z^q™" be a matrix such that the vector (1, . . . , 1) E Z>q is in 
the row span of A. Let h E M^q ^^ ^ vector of positive real numbers. Let = MJ^q and 
let K^'^ be the rational parametrization 



A,h 
•^3 



e ^z{9)-%\[eT' , (8) 

where 6 = {9i, . . . ,6^) and Z{9) is the appropriate normalizing constant. The toric 
model is the parametric algebraic statistical model 

MA,h = K^'\e) ■ (9) 

Independence models, exponential models, Markov chains and Hidden Markov 
chains can be given an algebraic statistical description by means of toric models (Pachter 
& Sturmbfels 2005). We keep positivity of A in the Definition 13.21 as a matter of 
convention. 

3.2. ME in terms of Toric Models 

Let X be a random variable taking values from the set [m] = {1,2, . . .m}. The only 
information we know about the pmf p = {pi, . . . ,Pm) of X is in the form of expected 
values of the functions tj : [m] — > M, z = 1, . . . , rf (we refer these functions as 'constraint 
functions'). We therefore have 

m 
Y,^^U)Pi = T^ ,1 = 1,. ..d , (10) 



where Ti, i = 1, . . . ,d, are assumed to be known. In an information theoretic approach 
to statistics, known as Jayens maximum entropy model, one would choose the pmf 
p G Am-i that maximize the Shannon entropy functional 

m 
Sip) = -^Pj'^'^Pj (11) 

with respect to the constraints ( ITOl) . 

The corresponding Lagrangian can be written as 

(rn \ d / m \ 

J2p^-i]-J2^4 T.^^i3)p, - T.. (12) 
i=i / j=i \i=i / 

Holding ^ = {^i, . . . ,^(i) fixed, the unconstrained maximum of Lagrangian E{p,^) over 
all p G Am-i is given by an exponential family (Cover & Thomas 1991) 



p,{0 = ZiO-'exp(-^^Mjyj 



,j = l,...,m, (13) 

where Z[C,) is normalizing constant given by 

m / d \ 

For various values of ^ G M*^, the family (IT3l) is known as maximum entropy m,odel. 
Now, we have following proposition. 

Proposition 3.3. The maximum entropy model ( flgj) is a toric model provided that the 
constraint functions are integer valued. 

Proof. Set ^i = —liaOi, i = 1, . . . ,d. Now, ( IT3l) gives us 

(d \ d 

-J2uij)i^eA=z{er'l[ef^=^ . (is) 

i=l J i=l 

By defining matrix A = [tj(j)] G Z''^" and setting h = (— , ...,—) we have rational 
parametrization as in (jHj). D 

Note that we allowed only integer valued functions in the ME-model in the above 
proposition, which is necessary for algebraic descriptions of the same. Here we also 
mention that in the above proof by assuming h G Am_i (which acts as a prior), we can 
imply that minimum I-divergence model (Csiszar 1975) 

Pj = ZiO hj exp I - ^ ^iTiiJ) J , J = 1, . . . , m, (16) 

(with appropriate normalizing constant Z{^)) is indeed a toric model. 

Once the specification of statistical model is done, the task is to calculate the model 
parameters with the available information. In this case the available information is in 
the form of expected valued of functions ti, i = 1, . . .d and the Lagrange parameters C,i, 
i = 1, . . . ,d are determined using the constrains (ITOl) . 



4. Calculation of ME distributions via solving Polynomial equations 

4.1. Direct method 

One can show that the Lagrange parameters in ME-model flT3|) can be estimated by 
solving following set of partial differential equations (Jaynes 1968) 

^lnZ(0 = Ti ,z = l,...,d, (17) 

which has no explicit analytical solution. In literature there are several methods 
of estimating ME-models. One of the important method is Darroch and Ratcliff's 
generalized iterative scaling algorithm (Darroch & Ratcliff 1972). Here we can show 
that ME-models can be calculated using computational algebraic methods. 

Note that set of all distributions which satisfies (ITOl) is known as linear family (we 
denote this by C). Now, if we represent the exponential family flT3l) by S, the set of 
statistical models that results from ME-principle can be written as CnS C Am_i. One 
can show that CCi S G A^-i is a variety. 

By substituting maximum entropy distributions (TT5|) in (TTOj) we get 

m d 

j=i i=i 

which can be written as 

m d m d 

Y.*^ij)U^^^'^ = T^Y.W^ ■ (19) 

j=l i=l j=l i=l 

The solutions of system of polynomial equations ( TT9l) gives the maximum entropy model 
spcified the available information (TTOj) . We state this as a proposition. 

Proposition 4.1. The maximum entropy model ( fl^j can he specified by solving set of 
polynomial equations provided that the constraint functions ti, i = 1,. . . ,d are integer 
valued. 

4.2. Dual Method 

Here we follow the method of dual optimization problem. By using Kuhn- Tucker 
theorem we calculate Lagrange parameters ^j, i = 1, . . . , c? in ( fT3l) by optimizing dual of 
S(j3, (^). That is the task is to find ^ which maximizes 

vI/(0 = H(p(«),0 . (20) 

Note that '^{C) is nothing but entropy of ME-distribution (fT3ll . We have 

d 
^{0 = \nZ + Y,iiTi . (21) 



This can be written as 

m / d \ d 

vl>(0 = In^exp i-Y^^,U{j) \+Y,^iT, 
j=i \ j=i J i=i 

m 

= ln^exp(6(T,-t,0-))) . (22) 

Now maximizing ^£^(1^) is equivalent to maximizing 

m 

vl/'(0 = 5^expte(T,-t,(j))) . (23) 

i=i 
By introducing ^j = — In ^j, i = 1, . . . , rf we have 

m d 

The solution is given by solving the following set of equations 

0,j = l,...d. (25) 



89, 



Unfortunately ^ G k[9^, . . . , 6* J] only if Tj G Z. Now, we consider the case where the 
expected values are available as sample means. 

In most practical problems the information in the form of expected values is 
available via sample or empirical means. That is, given a sequence of observations 
Oi,...,Oiv the sample means Tj, i = l,...,d, with respect to the functions tj, 
i = 1, . . . ,d are given by 

1 ^ 

T^ = J;^J2^^iOl),^ = l,...,d, (26) 

1=1 
and the underlying hypothesis is Ti ^ Ti. That is 

m 1 ^ 

J2pMj) ^ j;^J2^^iOi) ,i = l,...,d. (27) 

i=i 1=1 

Now we show that, by choosing alternate Lagrangian in the place of flT^ we can 
transform the parameter estimation of ME-model to a problem of solving set of 
polynomial (Laurent) equations. 

Proposition 4.2. Given the hypothesis ( [F7[ j the problem of estimating the ME-model 
in the dual method amounts to solving set of Laurent polynomial equations (assuming 
that constraint functions are integer valued). 

Proof. To retain the integer valued exponents in our final solution we consider the 
constrains of the form 

m 

Nj2^i{j)Pj = ai , i = l,...d , (28) 
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where ai = Yli=i^ii^i) denotes the sample sum. In this case Lagrangian is 

(m \ d / m \ 

Y.p^-^]-Y.^4^Y.pMJ)-^^] ■ (29) 
i = l / i=l V jr = l / 

This resuhs in the ME-distribution 

pA0 = Z{0~'expl~NJ2lt^{J)] , j = l,...,m, (30) 

where Z{C,) is normahzing constant given by 

m / d \ 

Z{0 = Y,expi-NY,lU{j)] ■ (31) 

To calculate the parameters we maximize the dual ^(0 of '^{Pi 0- That is we maximize 
the functional 

d 

^{0 = In Z + Y,l(^^ ■ (32) 

1=1 

It is equivalent to optimizing the functional 

m / d d 

j=l \i=l i=l 

By setting In^j = ^j we have 

m d 

^'(^)=En^^~'^*"^'" (33) 

The solution is given by solving the following set of equations 



We have 



,i = l,...d. (34) 



ek[e^,...,9^] ,1 = 1,. ..,d. (35) 



D 

In algebraic statistics, algebraic descriptions are used to analyze the maximum 
likelihood estimates of exponential models (Pachter & Sturmbfels 2005). In the view 
that maximum likelihood and maximum entropy are related, it will be interesting to 
compare these two methods from algebraic statistical point of view. 



11 

4 ■ 3. Kullback- Csiszar Iteration 

Minimum I-diverence princile is a generalization of maximum entropy principle, and 
which considers the cases where prior estimate of the distribution p is available. Given 
a prior estimate r G A^ and information in the form of (fTO!) one would choose the pmf 
p G Am that minimizes the Kullback-Leibler divergence 

m 

I{p\\r) = J2p^ln^ (36) 



with respect to the constaints (ITOl) . The corresponding minimum entropy distributions 
are in the form of 



Z(e)-Sexpf-^eA(j)) 



pM) = Z{0~\exp{-}_^^Mj)] ,J = l,...,m, (37) 

where Z{^) is normalizing constant given by 

m / d \ 

Z{0 = J2''^expi-J2^^t,{J)] . (38) 

j=i \ i=i / 

It is easy to see that estimating minimum entropy distributions can be translated to 
solving polynomial equations, when the feature functions are integer valued. Polynomial 
system one would solve in this case is 

m d 

Y.''^iUU)-T.)l[(^f'^'^ = ^- (39) 

j=i i=i 

Hence we have following proposition. 

Proposition 4.3. The estimation of minimum entropy model (7?) amounts to solving 
a set of polynomial equations in indeterminates 6i = exp(— ^j), i = l,...,d provided 
that the feature functions ti, i = 1, . . . ,d are positive and integer valued. 

Since an estimation of ME-distributions involves solving a system of nonlinear 
equations, which become inefficient, one would employ a interative method where one 
would estimate the distibution considering only one constraint at a time. We describe 
this procedure as follows. 

At A^**^ iteration, the algorithm computes the distribution p^^'^ which minimizes 
l[jpi^) \\p^^^^'>) with respect the i*^ constraint, 1 < i < d ii N = ad + i., for any positive 
integer a. In this iterative procedure we have p'-*'-* = r and p*^^-* is given by 

where {Z'^^^^ = YlT=i ''"jCi ■ Considering the first constriant in fITOl) can be estimated 
by soving polynomial equation 

m 

Y.''^itiU)-T^)Ci^^'^ = , (40) 
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with inderininate Ci- Siinilary we have 

where {Z^'^^) = YlT=iC2^ ■ Considering the first two constrains in fITU]) ME 



distribution can be estimated by solving 

m 

J2r,{hU)-T,)Ci^^^\',^^^^ = , (41) 

i=i 
along with (HOl) . 

In general, when N = ad + i for some positive integer a, p\ , for A^ = 1, 2 ... is 



given by 



3 



pf^ = r, (Z(i))"' . . . (z(^))"Vi*^^''^ . . . C^"^''^ 



(42) 



and is determined by the following system of polynomial equations 

Er=io(^i(j)-Ti)c^(^') =0 , " 

5. Conclusion and Directions for Future research 

In this paper we attempted to describe maximum (and hence minimum) entropy model 
in algebraic statistical framework. We showed that maximum entropy models are toric 
models when the constraint functions are assumed to be integer valued functions and 
the set of statistical models results from ME-principle is indeed an variety. In a dual 
estimation we demonstrated that when the information is in the form of empirical means, 
the calculation of ME-models can be transformed to solving set of Laurent polynomial 
equations. Work on computational algebraic algorithms for estimating ME-models are 
in progress. We hope that this will also shed light on possible interesting algebraic 
structures in information theoretic statistics. 
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